Triton 프로그래밍 입문: 기본 요소 연산을 넘어서: 타일 기반 행렬 연산으로의 전환

이전 수업에서는 요소별 연산 (예를 들어 행렬에 대한 기본적인 ReLU 함수). 이러한 연산은 메모리 제약형 GPU가 데이터를 고속 메모리(HBM)에서 레지스터로 이동하는 데 더 많은 시간을 소비하기 때문입니다.

1. GEMM의 핵심성

일반 행렬 곱셈(GEMM)은 계산 복잡도가 $O(N^3)$인 반면, 메모리 접근 요구는 $O(N^2)$에 불과합니다. 이는 방대한 산술 처리 능력으로 인해 메모리 지연을 숨길 수 있게 해주며, 이를 통해 대규모 언어 모델(LLMs)의 '심장'이 됩니다.

2. 2차원 메모리 표현

물리적 RAM은 1차원입니다. 2차원 텐서를 표현하기 위해 우리는 스트라이드. 일반적인 생산 환경에서 자주 발생하는 실수는 텐서가 연속적이라고 가정하는 것입니다. 포인터 수식에서 행과 열의 스트라이드를 혼동하면, '현실 외의' 데이터에 접근하거나 메모리 위반을 유발할 수 있습니다.

3. 타일화 일반화

Triton은 단일 포인터에서 단일 포인터 부터 포인터 블록으로 전환함으로써 요소 연산 로직을 일반화합니다. 2차원 타일(예: $16 \times 16$)을 사용함으로써 우리는 데이터 재사용 고속 SRAM 내에서 활용하여, 글로벌 메모리로 다시 쓰기 전에 배이스 추가나 활성화 함수와 같은 융합 연산을 수행할 수 있도록 데이터를 '뜨겁게' 유지합니다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Why is an elementwise ReLU on a large matrix considered 'memory-bound'?

The ReLU function requires complex transcendental math.

The ratio of arithmetic operations to memory loads is very low (1:1).

Matrices are naturally stored in CPU memory only.

Triton cannot process non-linear activations.

QUESTION 2

What is the result of 'The Stride Trap' in production kernels?

The kernel runs significantly faster but with less precision.

Memory access violations or corrupted output due to incorrect address calculation on non-contiguous tensors.

The GPU automatically corrects the indexing using L2 cache.

The tensor is forced into a 1D shape by the compiler.

QUESTION 3

How does Triton represent a 2D tile of pointers?

By using a nested Python list of integers.

By broadcasting a 1D column vector and a 1D row vector of offsets together.

By launching multiple 1D kernels sequentially.

By allocating a special 2D register file.

QUESTION 4

Which operation benefits most from the O(N³) complexity shift to hide memory latency?

Vector Addition

Matrix Multiplication (GEMM)

Sigmoid Activation

Global Average Pooling

QUESTION 5

List three kernels in your current workflow that launch multiple PyTorch ops and might benefit from fusion.

Linear -> Bias -> ReLU; LayerNorm -> Dropout; Softmax -> Masking.

Print -> Log -> Sleep.

DataLoader -> Augmentation -> Storage.

These ops cannot be fused in Triton.